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This is a petition to your notice of abandonment of the Application No. 
09/647,300, Art Unit 2655, that you mailed out on June 16, 2004 

In your notice of abandonment it is argued that "No reply was received to your 
letter of 30 October 2003" . I want to bring to your knowledge that I mailed to 
your patent office two answers to the corresponding notice, in December 2003 
respectively January 2004, plus a faxed answer during January 2004 and several 
telephone inquires about the status of my answers to supervisor Doris To. 

I attach photocopies of the mailing receipts with the corresponding tracking 
numbers. I also attach the other documents that were asked in your letter, as I 
did in the previous answers: 

• marked up version of the description with changes 

• clean version of the description with changes 

• fixed version of the claims 

• summary 

• disclosure of an article 

• copy of the article 

In the following I also include the content of my previous answer to you on 
January 17, that you might not have received: 

Answer to your recent note concerning the patent application 09/647,300, Art 
Unit 2655, Examiner Daniel Demelash Abebe 

This is an answer to your previous note related to the International applica- 
tion PCT/IB00/00189, US National stage 09/647,300, Art Unit 2655, Examiner 
Daniel Demelash Abebe. 

• About the patent application RO-C99-00214 that the examiner cannot 
locate: according to my knowledge, a copy of it was transmitted to you 



by the PCT office in Geneva (please verify the documents transmitted to 
you from PCT IB in Geneva). 

• I also attach an Information disclosure statement filled with the changes 
that you recommended in your previous note. I actually attach several ver- 
sions since I was not sure about different interpretations of your markings 
in the previous note. 

• The description and claims were numbered and reformated with double 
space. 

• The two markups (on page 5 respectively 6) were reversed to the previous 
version. There was no markup on page 10. A substitute specification with 
amendments is submitted. The reference to the parent application is also 
added in the beginning of the specification. Both a clew and a marked-up 
version are included, 37 CFR 1.125(b)(2), 37 CFR 1.125(c). 

• An amendment was made to each claim, by canceling and re- 
presenting/rewriting it. The formulation of the claims was done in the 
specified one sentence form. The translation is improved by correcting 
some mentioned spelling, grammar, and idiomatic errors. 

Statement conforming 37 CFR 1.125(b)(1) The substitute specification contains 
no new matter. 

I thank you in advance 



Marius Calin Silaghi 
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SPECIFICATION 



3 TITLE OF THE INVENTION 

Speech Recognition and Signal Analysis by Exact Fast Search of Subsequences with Max- 
imal Confidence Measure 

6 

REFERENCE TO APPENDIX SUBMITTED ON CD 
Not Applicable 

9 

CROSS-REFERENCE TO RELATED APPLICATION 

This patent application has as parent application the patent application C99-00214/25.02.1999 
12 registered with the State Office for Inventions and Trademarks (OSIM) in Bucharest, Ro- 
mania. The present application is the US national stage of the international application 
PCT/IBOO/00189 registered with the International Patent Office in Geneva. 

15 

BACKGROUND OF THE INVENTION 

18 FIELD OF THE INVENTION 

The invention relates to a common component of Speech Recognition, more particularly 
to the fields of Keyword Spotting and decoding, Segments Alignment for DNA and proteins, 
21 and Recognition of Objects in Images. 
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DESCRIPTION OF THE RELATED ART 

This invention addresses the problem of keyword spotting (KWS) in unconstrained speech 
3 without explicit modeling of non-keyword segments (typically done by using filler HMM 
models or an ergodic HMM composed of context dependent or independent phone models 
without lexical constraints). Several methods (sometimes referred to as sliding model rneth- 
6 ods) tackling this type of problem have already been proposed in the past. E.g., they use 
Dynamic Time Warping (DTW) or Viterbi matching allowing relaxation of the (begin and 
endpoint) constraints. These are known to require the use of an appropriate normalization of 
9 the matching scores since segments of different lengths have then to be compared. However, 
given this normalization and the relaxation of begin/endpoints, straightforward Dynamic 
Programming (DP) is no longer optimal (or, in other words, the DP optimality principle is 
12 no longer valid) and has to be adapted, involving more memory and CPU. Indeed, at any 
possible ending time e, the match score of the best warp and start time b of the reference has 
to be computed (for all possible start times b associated with unpruned paths). Finally, this 
15 adapted DP quickly becomes even more complex (or intractable) for more advanced scoring 
criteria (such as the confidence measures mentioned below). 

Work in the field of confidence level, and in the framework of hybrid HMM/ ANN systems 
18 has shown that the use of accumulated local posterior probabilities (as obtained at the 
output of a multilayer perceptron) normalized by the length of the word segment (or, better, 
involving a double normalization over the number of phones and the number of acoustic 
21 frames in each phone) was yielding good confidence measures and good scores for the re- 
estimation of TV-best hypotheses. However, so far the evaluation of such confidence measures 
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involved the estimation and rescoring of N-best hypotheses. 

KWS methods without filler models have in common the selection of a subsequence of 

3 the utterance to match the interesting keyword models. Let X = {x 1; x- 2 , x N ) 
denote the sequence of acoustic vectors in which we want to detect a keyword, and let M 
be the HMM model of a keyword M and consisting of L states Q = {q u q 2 , . . . , . . • , Ql}- 

6 Assuming that M is matched to a subsequence Xf; = {x b , . . . , x e } (1 < b < e < N) of X, 
and that we have an implicit (not modeled) garbage/filler state q G preceding and following 
M, one can define (approximate) the log posterior of a model M given a subsequence XI as 

9 the average posterior probability along the optimal path, i.e.: 

- log P(M\X!) * min - log P(Q\X;) 

1 rain {-log P(q b \q G ) 



e-b+1 vqsm 

„ -Etlog P{q n \x n ) + log P(q n+1 \q n )) 

11 n=b 

-log P(q e \x e )~ log P(q G \q e )} (1) 

where Q = {q b , q M , q e } represents one of the possible paths of length (e-b+1) in M , and 

15 q n the HMM state visited at time n along Q, with q n G Q. In this expression, q G represents 

the garbage (filler) state which is simply used here as the non-emitting initial and final state 

of M. Transition probabilities P(q b \qc) and P(qc\q e ) can be interpreted as the keyword 

18 entrance and exit penalties, but can be simply set to 1. Local posteriors P(qt\x n ) can be 

estimated using any of the known techniques: multi-gaussians, code-books, or as output 

values of a multilayer perceptron (MLP) used in hybrid HMM/ ANN systems. For a specific 

21 sub-sequence expression (1) can easily be estimated by dynamic programming since the 

sub-sequence and the associated normalizing factor (e-b+1) are given. However, in the 

3 
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case of keyword spotting, this expression should be estimated for all possible begin/endpoint 
pairs {b, e} (as well as for all possible word models), and we define the matching score of X 
3 on M as: 

S(M\X) = -\ogP(M\Xf) (2) 

where the optimal begin/endpoints {&*,e*}, and the associated optimal path Q*, are the 
6 ones yielding the lowest average local posterior: 

(Q\b*,e*) = argmin ~^ log P(Q\X e b ) (3) 
{QAe} e - b + 1 

Of course, in the case of several keywords, all possible models will have to be evaluated. 
9 A double averaging involving the number of frames per phone and the number of phones 
usually yields slightly better performance when used to rescore N-best candidates: 

(Q\b*,e*)= (4) 

-1 3 ( 1 Gj \ 
argmin — 5] r— r £ log P{q]\x n ) 

12 {QAe} J j=i \ e i ~ b i + 1 ntfc, J 

where J represents the number of phones in the hypothesized keyword model and q] the 
hypothesized phone qj for input frame x„. However, given the time normalization and 

15 the relaxation of begin/endpoints, straightforward DP is no longer optimal and has to be 
adapted, usually involving more memory and CPU. 

Filler-based KWS need a simpler decoding step. Although various solutions have been 

18 proposed towards the direct optimization of (2), most of the keyword spotting approaches 

today prefer to preserve the optimality and simplicity of Viterbi DP by modeling the complete 

input and explicitly or implicitly modeling non-keyword segments by using so called filler or 

21 garbage models as additional reference models. In this case, we assume that non-keyword 

4 
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Figure 2: ROC using criterion (4) (double normalization), 
on 242 BREF test sentences containing 100 keywords se- 
lected at random. 
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cific tuning) appear to be particularly competitive to other 
alternative approaches. 
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segments are modeled by extraneous garbage models/states q G (and grammatical constraints 

ruling the possible key wo rd/non- keyword sequences). 
3 [It is sufficient to consider only the case of detecting one keyword] = Let 

us consider only the case of detecting one keyword^ per utterance at a time. In this case, 

the keyword spotting problem amounts at matching the whole sequence X of length N onto 
6 an extended HMM model M consisting of the states {q G , 9i, - . . , q^ 9g}» in which a P ath 

6-1 N-e 

(of length N) is denoted Q = {^gT^g, q\ Q b +\ <f, Qg^Qg} with (6 - 1) garbage states 
q G preceding q b and (N - e) states q G following <f , and respectively emitting the vector 

9 sequences X±~ l and X^ +1 associated with the non-keyword segments. 

Given some estimation of P(q G \x n ) (e.g., using probability density functions trained on 
non keyword utterances), the optimal path Q* (and, consequently 6* and e*) is then given 

12 by: 

^ = argrmn-logP(Q|X) 
= argrmn{-logP(Q|X fe e ) 

- Zlog P(q G \x n )- £ logP( 9G |x n )} (5) 

which can be solved by straightforward DP (since all paths have the same length). The main 
problem of filler-based keyword spotting approaches is then to find ways to best estimate 

18 P(q G \x n ) in order to minimize the error introduced by the approximations. Sometimes this 
value was defined as the average of the TV best local scores while, in other approaches, this 
value is generated from explicit filler HMMs. However, these approaches will usually not 

21 lead to the optimal solution given by (2). 
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BRIEF SUMMARY OF THE INVENTION 

The invention belongs to the technical domain of decoding, classification, alignment and 
3 matching of data. 

The invention introduces a new method performing tasks in keyword spotting in utter- 
ances, detection of subsequences in chains of organic matter (DNA and proteins) and recog- 
6 nition of objects in images. The proposed methods search in an optimized way the matching 
that maximizes, over all the possible matchings, certain confidence measures based on nor- 
malized posteriors. Three such confidence measures are used, two existed in previous work 
9 in Speech Recognition, and the third one is a new one. 

Application fields for this invention are: man-machine interfaces (using speech recogni- 
tion; ex: control systems, banking, flight services, etc), coordination systems (for industrial 
12 robots and automata) and development systems for pharmaceutic products. 

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS 
15 Not Applicable 

DETAILED DESCRIPTION OF THE INVENTION 

18 {The present invention introduces a fast iterative method ,] s In the following, 
we show that it is possible to define an iterative process, s referred to as Iterating Viterbi De- 
coding (IVD) with good/fast convergence properties, estimating the value of P(q G \x n ) such 

21 that straightforward DP (5) yields exactly the same segmentation (and recognition results) 
than (3). While the same result could be achieved through a modified DP in which all pos- 
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sible combinations (all possible begin/endpoints) would be taken into account, the method 
proposed below is much more efficient (in terms of both CPU and memory requirements). 

3 Compared to previously devised sliding model methods the first method proposed here 
is based on: (A) A matching score defined as the average observation probability (posterior) 
along the most likely state sequence. It is indeed believed that local posteriors are more 

6 appropriate to the task. (B) The iteration of a Viterbi decoding [algorithm] ^ which does 
not require scoring for all begin/endpoints or N-best rescoring, and which can be proved to 
(quickly) converge to the optimal (from the point of view of the chosen scoring functions) 

9 solution without requiring any specific filler models, using straightforward Viterbi alignments 
(similar to regular filler-based KWS, but for some versions at the cost of a few iterations). 
The IVD method is based on a similar criterion as the filler based approaches (5), but 
12 rather than looking for explicit (and empirical) estimates of P(q G \x n ) we aim at mathe- 
matically estimating its value (which will be different and adapted to each utterance) such 
that solving (5) is equivalent to solving (3). Thus, we perform an iterative estimation of 
15 Pfaclzn), such that the segmentation resulting of (5) is the same than what would be ob- 
tained from (3). Defining e t = - log P(q G \x n ) at iteration t, the proposed method can be 
summarized as follows: 

18 1. Start the first iteration, t = 0, from an initial value e 0 = II (it is actually proven that 
the iterative process presented here will always converge to the same solution, in more 
or less cycles with the worst case upper bound of N iterations, independently of this 

21 initialization, e.g., with II equal with a cheap estimation of the score of a match). 

In one of the developed versions, e 0 is initialized to - log of the maximum of the local 
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probabilities P(qk\x n ) for each frame x n . 

An alternative choice is to initialize £o to & pre-defined threshold score, T, that expres- 
3 sion (1) should reach to declare a keyword matching (see step 4 below). In this last 

case, if S\ > Sq at the first iteration, then we can (as proven) directly infer that the 
match will be rejected, otherwise it will be accepted. 

6 2. Given the estimate e t of P(qc\x n ) at current iteration t, find the optimal path (Q t , b u e t ) 
according to (5) and matching the complete input. 

3. Estimate the value of e t +i to be used in the next iteration as the average of the local 
9 posteriors along the optimal path Q t (matching the X\\ resulting of (5) on the keyword 

model) i.e.: 

12 4. Increment t and return to (2) iterating until convergence is detected. If we are not 
interested in the optimal segmentation, this process could also be stopped as soon as it 
reaches a lower than a (pre-defined) minimum threshold, T, below which we can 

15 declare that a keyword has been detected. 

Correctness and convergence proof of this process and generalization to other criteria, are 
available: each IVD iteration (from the second iteration) will decrease the value ofe u and the 
18 final path yields the same solution than (3). The above method has a very good experimental 
convergence speed (3-5 iterations in our tests). For one version of IVD (when e 0 is initialized 
using the acceptance threshold, T), the detection is decided after one single step. 
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A version with the same effort but suboptimal results is proposed in the following para- 
graph. Let T(M,X) be a matrix holding the HMM emission probabilities for an utterance 

3 X whose time-frames define the columns, and where the states of the hypothesized word 
W define the rows. When using the standard DP, one computes for each element of the 
matrix T(M } X) at frame k of X and state s of M three values: S ks , L ks and C ks) where 

6 Sks corresponds to the sum of the entries on the optimal path that leads to the entry, L ks 
holds the length of the optimal path computed so far, and C ks is the estimation of the cost 
on the optimal expanded path. By a path leading to an entry T(/c, s) we mean a sequence 

9 of entries in the table T, such that there is exactly an entry for each time frame t<k. At 
each entry T(fc, s), DP selects a locally optimal path noted P ks . At each step /c, we consider 
all pairs of entries of table T(M,X) of type T(fc,s), T{k - l,t). We update for each such 
12 pair, the current cost C ks (initially oo), by comparing it with the alternative given by: 

Sks = S( k -i)t - \ogp(s\x k )p{s\t) 

L hs = L {k . l)t + iyt>0,t<L 
15 C k$ = g (7) 

wanting to have at step k the path P ks from the paths P(k-\)t that minimizes Cnl- With 

DP, one will choose the P k$ with minimal C ks . 
18 This version can yield suboptimal results since the optimality principle is not respected 

by the expression 7. The optimality principle of Dynamic Programming requires that the 

path to the frame k — 1 that minimizes Cnl, also minimizes C ks for an entry at frame k of 
21 table T(M,X). 
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Another technique that is suboptimal in time and/or quality is obtained from the previous 
one adopting a beam-search approach and a set of safe prunings. The Dynamic Programming 

3 can be viewed els a set of safe prunings that are applied at each entry of the DP table and 
has the property that only one alternative is maintained. Dynamic Programming cannot be 
used, since the principle of optimality is not respected. The following types of safe pruning 

6 that can be done are introduced by the present invention. Within the current invention we 
found a set of safe prunings as follows: we have proved that if at a frame a we have two paths 
P' a and P£ with S% < S f a and Ll a < then at no frame c>a will a path P" be forsaken for 

9. a path F c if P' a CP' c , P'X P c and P^=P^\P^ We will note the order relation as F^F a . 
We have further shown that a path P 5 may be safely discarded only when we know a lower 
cost one, P" . 

12 F<P" ^C k < CI (8) 

Thus, the method described in following method computes S(M,X) and Q* from equa- 
tion (3). By ordering the set of paths, according to Equation 8, we only need to check the 

15 step (1.1) of the following method up to the eventual insertion place. The last paths are 
candidates for pruning in step (1.2). In order for the pruning to be acceptable, we will prune 
only paths that were too long on the last state. An additional counter for each path is 

18 needed for storing the state length. This counter is reset when an entry from another row 
is added and is incremented at each advance with a frame. The following steps detail this 
method for a model W and an utterance X: 

21 a) Initialize all elements of a matrix, SetOfPaths(l..N, 1..K), to 0 

b) For all frames from 1 to N, for all states from 1 to K, for all candidates pi in 
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SetOfPaths(frame-l } L.K): 

- For all pj in SetOfPaths[frame, state], if Pi<Pj then delete pj (1.1), and if Pj<pi 
then continue step b) (1.2) 

- Insert pi in SetOfPaths[frame, state] 

c) Select SetOfPaths [frame, K] as the best of the candidates 

The next method builds on the previous technique and is a fast procedure for maximizing 
a more complex confidence measure that yields better results in practice. The corresponding 
confidence measure is defined as: 

1 Hpstehj - log(P*0 (9) 



9 NVP h f^ vp length{hi) 

where NVP stands for the number of visited phonemes and VP stands for the set of visited 
phonemes. An average is computed over all posteriors pst of the emission probabilities for 

12 the time frames matched to the visited phoneme h it The function length(hi) gives the 
number of time frames matched against This method uses a breath first Beam Search [ 
algorithm] == . It exploits a set of reduction rules and certain normalizations. For the state 

15 q Gy in this method, the logarithm of the emission posterior is equal with zero. For each frame 
e and for each state s, the set of paths/probabilities of having the frame e in the state s is 
computed as the first Af maxima (M can be finite) of the confidence measure for all paths in 

18 HMM M of length e and ending in the state s. The paths that according to the reduction 

rules will loose the final race when compared with another already known path, will be 

deleted as well. Let us note a u p u l u respectively a 2) P2 and l 2 the confidence measure for 

21 the previously visited phonemes, the posterior in the current phoneme and the length in the 
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current phoneme for the path Q u respectively the path Q 2 . The rules that can be used for 
the reduction of the search space by discarding a path Q x for a path Q 2 are in this case any 
3 of the next ones: 

1. h>h> A > 0, B < 0 and L\A + L c B + C>0 

2. l 2 >l u A > 0, B > 0 and C > 0 

6 3. l 2 >l u A < 0, C > 0 and L 2 A + LB + C > 0 
4. l 2 >l u A = 0, B < 0 and LB + C > 0 

where A = a x - a 2 , B = (a x - a 2 )(/i + / 2 ) + Pi - P2, C=(a! - a 2 )y 2 + pi/ 2 - V2h, L = 
9 L max - max^i,^}, L c = -B/2A > 0 and L max is the maximum acceptable length for a 

phoneme. By discarding paths only if one of the above rules is satisfied, the optimum defined 

by the confidence measure with double normalization can be guaranteed, if no phone may be 
12 avoided by the HMM M. Any HMM may be decomposed in HMMs with this quality The 

4-th rule is included in the 3-rd and its test is useless if the last one was already checked. 

The first test, l 2 > h tells us if Q 2 has chances to eliminate Q u otherwise we will check 
15 if Qi eliminates Q 2 * These tests were inferred from the conditions of maintaining the final 

maximal confidence measure while reduction takes place. In order to use the method of 

double normalization without decomposing HMMs that skip some phonemes, the previous 
18 rules are modified taking into account the number of visited phonemes for any path F x 

respectively F 2 and the number of phonemes that may follow the current state. A simplified 

test can be: 

21 • l 2 > l u A > 0, pi > p 2 respectively F 2 >Fi for the HMMs that skips phonemes. 

12 



This test is weaker than the 2 nd reduction rule. For example a path is eliminated by a second 
path if the first one has an inferior confidence measure (higher in value) for the the previous 

3 phonemes, a shorter length and the minus of the logarithm of the cumulated posterior in 
the current phoneme also inferior (higher in value) to that of the second one. An additional 
confidence measure based on the maximal length, L max , and on the maximum of the minus 

6 of the logarithm of the cumulated and normalized posterior in phoneme, P marc , can be used 
in order to limit the number of stored paths. 

• p > L max P m ax in any state 

9 • f > Pmax at the output from a phoneme 

where p and 1 are the values in the current phoneme for the minus of the logarithm of 
cumulated posterior and for the length of the path that is discarded. These tests allow for 

12 the elimination of the paths that are too long without being outstanding, respectively of 
the paths with phonemes having unacceptable scores, otherwise compensated by very good 
scores in other phonemes. If N is chosen equal with one, the aforementioned rules are no 

15 longer needed, but always we propagate the path with the maximal current estimation of 
the confidence measure. The obtained results are very good, even if the defined optimum is 
guaranteed for this method only when M is bigger than the length of the sequence allowed 

18 by L max or of the tested sequence. The same approach is valid for the simple normalization, 
where the HMM for the searched word will be grouped into a single phoneme. 

The present invention can exploit a newly designed a confidence measure, version named 

21 Real Fitting, that represents differently the exigencies of the recognition. Since the phonemes 

and the absent states can be modeled by the used HMMs, we find it interesting to request the 
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fitting of each phoneme in the model with a section of the sequence. Therefore, we measure 
the confidence level of a subsequence as being equal with the maximum over all phonemes 
3 of the minus of the logarithm of the cumulated posterior of the phone, normalized with its 
length: 

Zpkonem ~ \og{pOSteriors) 

max — r ; 77 

phoneme Visited Phonems pflOUem ItUqih 

6 The rule that may be used in this framework for the reduction of the number of visited paths 
is: 

• Q 2 is discarded in favor of another path Qi if the confidence measure of the Real 
9 Fitting for the previous phonemes is inferior (higher in value) for Q 2 compared with 

Qi, and if p\ < p 2 and l 2 < h- 

where p u l u respectively p 2) h represent the minus of the logarithm of the cumulated poste- 
12 rior respectively the number of frames in the current phoneme for the path Qi respectively 
Q 2 . Similarly to the previous method, the set of visited paths can be pruned by discarding 
those where: 

15 • p > L max Pmax in any state 

• f > Pmax at the output from a phoneme 

where p and 1 are the values in the current phoneme for the minus of the logarithm of the 
18 cumulated posterior and for the length of the path that is discarded. We recall that the 
meaning of the constants are the maximal length L max , respectively the accepted maxima 
of the minus of the logarithm of the cumulated and normalized posterior in phoneme, Pmax- 
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This invention thus proposes a new method for keyword spotting, based on recent ad- 
vances in confidence measures, using local posterior probabilities, but without requiring the 

3 explicit use of filler models. A new method, referred to as Iterating Viterbi Decoding (IVD), 
to solve the above optimization problem with a simple DP process (not requiring to store 
pointers and scores for all possible ending and start times). Other three new beam-search 

6 {algorithms} juersions^ corresponding to three different confidence measures are also pro- 
posed. 

To summarize, the object of the invention consists of: 

9 • Method of recognition of a subsequence using a direct maximization of confidence 
measures. 

• The method of IVD for directly maximizing the confidence measures based on simple 
12 normalization. 

• The use of the confidence measure and method of recognition named Real Fitting, 
based on individual fitting for each phoneme. 

15 • Methods of recognition using simple and double normalization by: 

• combining these measures with additional confidence measures mentioned here, respec- 
tively the maximal length and real matching limitation. 

18 • The use of the aforementioned methods in keyword recognition. 

• The use of the aforementioned methods in subsequence recognition of organic matter. 
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The use of the aforementioned methods in recognition of objects in images. 



3 DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

Execution: The method can be performed using a personal computer or can be imple- 
mented in specialized hardware. 

6 LA representation under the form of an HMM is obtained for the subsequences that are 
looked for (word, protein profile, section of an image of the object). 

2. A tool will be obtained (eventually trained Ex: for speech recognition) for the esti- 
9 mation of the posteriors. For example multi-Gaussians, neuronal networks, clusters, 

database with Generalized Profiles and mutation matrices (PAM, BLOSSUM, etc.). 

3. One of the proposed [algorithms} = versions = should be implemented. They yield close 
12 performance but the method of Real Fitting coupled with a well checked dictionary 

should perform best. 

For the first [algorithm} jnethod = (IVD) 

15 (a) The classic [algorithm of J = Viterbi is implemented with the modification that, 

for each pair P = {sample, state) one propagates the time-frame of transition 
between the state q G and the states of the HMM M for the path that arrives at P. 

18 These are inherited from the path that wins the entrance in the pair P, excepting 

for the moment when their decision is taken, namely when they receive the index 
of the corresponding sample. 
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(b) w - -log P(M\X%) is computed by subtracting from the cumulated posterior 
that is returned by the Viterbi algorithm for the path Q e b l 0 the value (N - (e t - 
b t + l))*e t corresponding to the contribution of the states qc and dividing the 
result through e t - b t + 1. e t - b t + 1 from the previous formula can be factored 
outside the fraction. 

(c) The initialization of e is made with an expected mean value. One can use the 
w that is computed when the state qc is associated with an emission posterior 
equal to the average of the best K emission probabilities of the current sample 
as done in the well-known garbage on-line model. In this case, K is trained using 
the corresponding technique. 

The next Beam search [algorithms] jnethods^ are implemented according to the 
description in the corresponding sections. For each pair P = (sample, state) one 
computes for each corresponding path the sum and length in the last phoneme, as well 
as the sum over the normalized cumulated posteriors of the previous phonemes (and 
their number). Also, the entrance and exit samples into the HMM M are computed 
and propagated like in the previous method, in order to ensure the localization of the 
subsequence. 

If one searched entity (keyword, sequence, object) can have several HMM models, all 
of them are taken into consideration as competitors. This is the case of the words 
with several pronunciations (or of the objects that have different structures in different 
states, for the recognition in images). 

After the computation of the confidence measure for each model of the subsequences, 
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one eliminates those with a confidence measure in disagreement with a threshold that 
is trained for the configuration and the goal of the given application. For example, for 
3 speech recognition with neuronal networks and minus of the logarithm of the posteriors, 

the threshold is chosen in the wanted point of the ROC curve obtained in tests. 

5. The remained alternatives are extracted in the order of their confidence measure and 
6 with the elimination of the conflicting alternatives until exhaustion. Each time when 

an alternative is eliminated, the searched entity with the corresponding HMM is re- 
estimated for the remaining sections in the sequence in which the search is performed. 
9 If the new confidence measure passes the test of the threshold, then it will be inserted 

in the position corresponding to its score in the queue of alternatives. 

6. The successful alternatives can undergo tests of superior levels like for example a 
12 question of confirmation for speech recognition, opinion of one operator, etc. 

7. For objects recognition in images: 

Posteriors are obtained by computing a distance between the color of the model and 
15 that of element in the section of the image. If the context requires, the image will be 

preprocessed to ensure a certain normalization (Ex: changeable conditions of light will 
make necessary a transformation based on the histogram). 

18 The phonemes of the speech recognition correspond to parts of the object. The struc- 

ture (existence of transitions and their probabilities) can be modified, function of the 
characteristics detected along the current path. For example, after detecting regions 

21 of the object with certain lengths, one can estimate the expected length of the remain- 
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ing regions. Thus, the number of the expected samples for the future states can be 
established and the HMM attached to the object will be configured accordingly. 

A direction is scanned for the detection of the best fitting and afterwards, other direc- 
tions will be scanned for discovering new fittings, as well as for testing the previous 
ones. The final test will be certified by classical methods such as cross-correlation or 
by the analysis of the contours in the hypothesized position. 

To mention some examples for the application of the proposed method: 

• The recognition of keywords begins to be used in answering automates of banking 
system as well as telephone and automates for control, sales or information. The 
method offers a possibility to recognize keywords in spontaneous speech with multiple 
speakers. 

• The recognition of DNA sequences is important for the study of the human Genome. 
One of the biggest problem of the involved techniques consists in the high quantity of 
data that have to be processed. 

• The recognition of objects in images is used, among others, in cartography and in the 
coordination of industrial robots. The method allows a quick estimation of the position 
of the objects in scenes and can be validated with extra tests, using classical methods 
of cross-correlation. 
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